Ordinary least squares

In statistics, ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation. The resulting estimator can be expressed by a simple formula, especially in the case of a single regressor on the right-hand side.

The OLS estimator is consistent when the regressors are exogenous and there is no multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors be normally distributed, OLS is the maximum likelihood estimator. OLS is used in economics (econometrics) and electrical engineering (control theory and signal processing), among many areas of application.

Contents

Linear model

Suppose the data consists of n observations { y
i
, x
i
 }n
i=1
. Each observation includes a scalar response yi and a vector of predictors (or regressors) xi. In a linear regression model the response variable is a linear function of the regressors:


    y_i = x'_i\beta %2B \varepsilon_i, \,

where β is a 1 vector of unknown parameters; εi's are unobserved scalar random variables (errors) which account for the discrepancy between the actually observed responses yi and the "predicted outcomes" x′iβ; and ′ denotes matrix transpose, so that x′ β is the dot product between the vectors x and β. This model can also be written in matrix notation as


    y = X\beta %2B \varepsilon, \,

where y and ε are 1 vectors, and X is an n×p matrix of regressors, which is also sometimes called the design matrix.

As a rule, the constant term is always included in the set of regressors X, say, by taking xi1 = 1 for all i = 1, …, n. The coefficient β1 corresponding to this regressor is called the intercept.

There may be some relationship between the regressors. For instance, the third regressor may be the square of the second regressor. In this case (assuming that the first regressor is constant) we have a quadratic model in the second regressor. But this is still considered a linear model because it is linear in the βs.

Assumptions

There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results, the only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data at hand, and on the inference task which has to be performed.

One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case ('random design) the regressors xi are random and sampled together with the yis from some population, as in an observational study. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors X are treated as known constants set by a design, and y is sampled conditionally on the values of X as in an experiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X. All results stated in this article are within the random design framework.

Classical linear regression model

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the number of observations is allowed to grow to infinity. Some results within this framework can be derived only under the assumption of normally distributed error terms, an assumption which is frequently criticized in practical applications.

I.i.d. specification

In some applications, especially with cross-sectional data, an additional assumption is imposed — that all observations are independent and identically distributed (iid). This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size n → ∞), which are understood as a theoretical possibility of fetching new independent observations from the data generating process. The list of assumptions in this case is:

Time series model

Estimation

Suppose b is a "candidate" value for the parameter β. The quantity yixib is called the residual for the i-th observation, it measures the vertical distance between the data point (xi, yi) and the hyperplane y = x′b, and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS))[5] is a measure of the overall model fit:


    S(b) = \sum_{i=1}^n (y_i - x'_ib)^2 = (y-Xb)'(y-Xb).

The value of b which minimizes this sum is called the OLS estimator for β. The function S(b) is quadratic in b with positive-definite Hessian, and therefore this function possesses a unique global minimum, which can be given by an explicit formula:[6][proof]


    \hat\beta = {\rm arg}\min_{b\in\mathbb{R}^p} S(b) =  \bigg(\frac{1}{n}\sum_{i=1}^n x_ix'_i\bigg)^{\!-1} \!\!\cdot\, \frac{1}{n}\sum_{i=1}^n x_iy_i = (X'X)^{-1}X'y\ .

After we have estimated β, the fitted values (or predicted values) from the regression will be


    \hat{y} = X\hat\beta = Py,

where P = X(X′X)−1X′ is the projection matrix onto the space spanned by the columns of X. This matrix P is also sometimes called the hat matrix because it "puts a hat" onto the variable y. Another matrix, closely related to P is the annihilator matrix M = InP, this is a projection matrix onto the space orthogonal to X. Both matrices P and M are symmetric and idempotent (meaning that P2 = P), and relate to the data matrix X via identities PX = X and MX = 0.[7] Matrix M creates the residuals from the regression:


    \hat\varepsilon = y - X\hat\beta = My = M\varepsilon.

Using these residuals we can estimate the value of σ2:


    s^2 = \frac{\hat\varepsilon'\hat\varepsilon}{n-p} = \frac{y'My}{n-p} = \frac{S(\hat\beta)}{n-p},\qquad
    \hat\sigma^2 = \frac{n-p}{n}\;s^2

The denominator, n-p, is the statistical degrees of freedom. The first quantity, s2, is the OLS estimate for σ2, whereas the second, \scriptstyle\hat\sigma^2, is the MLE estimate for σ2. The two estimators are quite similar in large samples; the first one is always unbiased, while the second is biased but minimizes the mean squared error of the estimator. In practice s2 is used more often, since it is more convenient for the hypothesis testing. The square root of s2 is called the standard error of the regression (SER), or standard error of the equation (SEE).[7]

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. The coefficient of determination R2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y:[8]


    R^2 = \frac{\sum(\hat y_i-\overline{y})^2}{\sum(y_i-\overline{y})^2} = \frac{y'LPy}{y'Ly} = 1 - \frac{y'My}{y'Ly} = 1 - \frac{\rm SSR}{\rm TSS}

where TSS is the total sum of squares for the dependent variable, L = In11′/n, and 1 is an n×1 vector of ones. (L is a "centering matrix" which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for R2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.

Simple regression model

If the data matrix X contains only two variables: a constant, and a scalar regressor xi, then this is called the "simple regression model".[9] This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The vectors of parameters in such model is 2-dimensional, and is commonly denoted as (α, β):


    y_i = \alpha %2B \beta x_i %2B \varepsilon_i.

The least squares estimates in this case are given by simple formulas


    \hat\beta = \frac{ \sum{x_iy_i} - \frac{1}{n}\sum{x_i}\sum{y_i} }
                     { \sum{x_i^2} - \frac{1}{n}(\sum{x_i})^2 } = \frac{ \mathrm{Cov}[x,y] }{ \mathrm{Var}[x] } , \quad
    \hat\alpha = \overline{y} - \hat\beta\,\overline{x}\ .

Alternative derivations

In the previous section the least squares estimator \scriptstyle\hat\beta was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^β = (X′X)−1X′y, the only difference is in how we interpret this result.

Geometric approach

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations y, where β is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies


    \hat\beta = {\rm arg}\min_\beta\,\lVert y - X\beta \rVert,

where ||·|| is the standard L2 norm in the n-dimensional Euclidean space Rn. The predicted quantity is just a certain linear combination of the vectors of regressors. Thus, the residual vector y − Xβ will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X. The OLS estimator \scriptstyle\hat\beta in this case can be interpreted as the coefficients of vector decomposition of ^y = Py along the basis of X.

Maximum likelihood

The OLS estimator is identical to the maximum likelihood estimator under the normality assumption for the error terms.[10][proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson. From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér-Rao bound for variance) if the normality assumption is satisfied.[11]

Generalized method of moments

In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions


    \mathrm{E}\big[\, x_i(y_i - x_i'\beta) \,\big] = 0.

These moment conditions state that the regressors should be uncorrelated with the errors. Since xi is a p-vector, the number of moment conditions is equal to the dimension of the parameter vector β, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.

Note that the original strict exogeneity assumption E[εi | xi] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E[ƒ(xiεi] = 0 will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function ƒ is to take ƒ(x) = x, which results in the moment equation posted above.

Finite sample properties

First of all, under the strict exogeneity assumption the OLS estimators \scriptstyle\hat\beta and s2 are unbiased, meaning that their expected values coincide with the true values of the parameters:[12][proof]


    \operatorname{E}[\, \hat\beta \,| X \,] = \beta, \quad \operatorname{E}[\,s^2\,|X\,] = \sigma^2.

If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.

The variance-covariance matrix of \scriptstyle\hat\beta is equal to [13]


    \operatorname{Var}[\, \hat\beta \,| X \,] = \sigma^2(X'X)^{-1}.

In particular, the standard error of each coefficient \scriptstyle\hat\beta_j is equal to square root of the j-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ2 with its estimate s2. Thus,


    \widehat{\operatorname{s.\!e}}(\hat{\beta}_j) = \sqrt{s^2 (X'X)^{-1}_{jj}}

It can also be easily shown that the estimator \scriptstyle\hat\beta is uncorrelated with the residuals from the model:[13]


    \operatorname{Cov}[\, \hat\beta,\hat\varepsilon \,|X\,] = 0.

The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic) the estimator \scriptstyle\hat\beta is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator \scriptstyle\tilde\beta which would be linear in y and unbiased, then [13]


    \operatorname{Var}[\, \tilde\beta \,| X \,] - \operatorname{Var}[\, \hat\beta \,| X \,] \geq 0

in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε, other, non-linear estimators may provide better results than OLS.

Assuming normality

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However if you are willing to assume that the normality assumption holds (that is, that ε ~ N(0, σ2In)), then additional properties of the OLS estimators can be stated.

The estimator \scriptstyle\hat\beta is normally distributed, with mean and variance as given before:[14]


    \hat\beta\ \sim\ \mathcal{N}\big(\beta,\ \sigma^2(X'X)^{-1}\big)

This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators.[11] Note that unlike the Gauss–Markov theorem, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.

The estimator s2 will be proportional to the chi-squared distribution:[15]


    s^2\ \sim\ \frac{\sigma^2}{n-p} \cdot \chi^2_{n-p}

The variance of this estimator is equal to 2σ4/(n − p), which does not attain the Cramér–Rao bound of 2σ4/n. However it was shown that there are no unbiased estimators of σ2 with variance smaller than that of the estimator s2.[16] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error) estimator in this class will be ~σ2 = SSR / (n − p + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (p = 1).[17]

Moreover, the estimators \scriptstyle\hat\beta and s2 are independent,[18] the fact which comes in useful when constructing the t- and F-tests for the regression.

Influential observations

As was mentioned before, the estimator \scriptstyle\hat\beta is linear in y, meaning that it represents a linear combination of the dependent variables yi's. The weights in this linear combination are functions of the regressors X, and generally are unequal. The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator.

To analyze which observations are influential we remove a specific j-th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the OLS estimator for β will be equal to [19]


    \hat\beta^{(j)} - \hat\beta = - \frac{1}{1-h_j} (X'X)^{-1}x'_j\hat\varepsilon_j\,,

where hj = xj′ (X′X)−1xj is the j-th diagonal element of the hat matrix P, and xj is the vector of regressors corresponding to the j-th observation. Similarly, the change in the predicted value for j-th observation resulting from omitting that observation from the dataset will be equal to [19]


    \hat{y}_j^{(j)} - \hat{y}_j = x'_j\hat\beta^{(j)} - x'_j\hat\beta = - \frac{h_j}{1-h_j}\,\hat\varepsilon_j

From the properties of the hat matrix, 0 ≤ hj ≤ 1, and they sum up to p, so that on average hjp/n. These quantities hj are called the leverages', and observations with high hjs — leverage points.[20] Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.

Partitioned regression

Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form


    y = X_1\beta_1 %2B X_2\beta_2 %2B \varepsilon,

where X1 and X2 have dimensions n×p1, n×p2, and β1, β2 are p1×1 and p2×1 vectors, with p1 + p2 = p.

The Frisch–Waugh–Lovell theorem states that in this regression the residuals \hat\varepsilon and the OLS estimate \scriptstyle\hat\beta_2 will be numerically identical to the residuals and the OLS estimate for β2 in the following regression:[21]


    M_1y = M_1X_2\beta_2 %2B \eta\,,

where M1 is the annihilator matrix for regressors X1.

The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the demeaned variables but without the constant term.

Constrained estimation

Suppose it is known that the coefficients in the regression satisfy a system of linear equations


    H_0\!:\ \ Q'\beta = c, \,

where Q is a p×q matrix of full rank, and c is a 1 vector of known constants, where q < p. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint H0. The constrained least squares (CLS) estimator can be given by an explicit formula:[22]


    \hat\beta^c = \hat\beta - (X'X)^{-1}Q\Big(Q'(X'X)^{-1}Q\Big)^{-1}(Q'\hat\beta - c)

This expression for the constrained estimator is valid as long as the matrix X′X is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, β will not be identifiable. However it may happen that adding the restriction H0 makes β identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [23]


    \hat\beta^c = R(R'X'XR)^{-1}R'X'y %2B \Big(I_p - R(R'X'XR)^{-1}R'X'X\Big)Q(Q'Q)^{-1}c,

where R is a (p−q) matrix such that the matrix [Q R] is non-singular, and R′Q = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when X′X is invertible.[23]

Large sample properties

The least squares estimators are point estimates of the linear regression model parameters β. However generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.

Since we haven't made any assumption about the distribution of error term εi, it is impossible to infer the distribution of the estimators \hat\beta and \hat\sigma^2. Nevertheless, we can apply the law of large numbers and central limit theorem to derive their asymptotic properties as sample size n goes to infinity. Now of course in practice sample size doesn't go anywhere, however it is customary to pretend that n is "large enough" so that the true distribution of OLS estimator is close to its asymptotic limit, and the former may be approximately replaced by the latter.

We can show that under the model assumptions, least squares estimator for β is consistent (that is \hat\beta converges in probability to β) and asymptotically normal:[proof][Q_{xx} is undefined.]

\sqrt{n}(\hat\beta - \beta)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2Q_{xx}^{-1}\big),

where Q_{xx} = X'X.

Using this asymptotic distribution, approximate two-sided confidence intervals for the j-th component of the vector \hat\beta can be constructed as

\beta_j \in \bigg[\ 
    \hat\beta_j \pm q^{\mathcal{N}(0,1)}_{1-\alpha/2}\!\sqrt{\tfrac{1}{n}\hat\sigma^2\big[\hat{Q}_{xx}^{-1}\big]_{jj}}
    \ \bigg]   at the 1 − α confidence level,

where q denotes the quantile function of standard normal distribution, and [·]jj is the j-th diagonal element of a matrix.

Similarly, the least squares estimator for σ2 is also consistent and asymptotically normal (provided that the fourth moment of εi exists) with limiting distribution

\sqrt{n}(\hat\sigma^2-\sigma^2)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\operatorname{E}[\varepsilon_i^4]-\sigma^4\big).

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose x_0 is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The mean response is the quantity y_0=x'_0\beta, whereas the predicted response is \hat{y}_0=x'_0\hat\beta. Clearly the predicted response is a random variable, its distribution can be derived from that of \hat\beta:

\sqrt{n}(\hat{y}_0 - y_0)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2x'_0Q_{xx}^{-1}x_0\big),

which allows construct confidence intervals for mean response y_0 to be constructed:

y_0\in\bigg[\ x_0'\hat\beta \pm q^{\mathcal{N}(0,1)}_{1-\alpha/2}\!\sqrt{\tfrac{1}{n}\hat\sigma^2x'_0\hat{Q}_{xx}^{-1}x_0}\ \bigg]   at the 1 − α confidence level.

Hypothesis testing

Example with real data

The following data set gives average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975).

 Height (m):  1.47 1.50 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.70 1.73 1.75 1.78 1.80 1.83
 Weight (kg):  52.21 53.12 54.48 55.84 57.20 58.57 59.93 61.29 63.11 64.47 66.28 68.10 69.92 72.19 74.46

When only one dependent variable is being modeled, a scatterplot will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model:

w_i = \beta_1 %2B \beta_2 h_i %2B \beta_3 h_i^2 %2B \varepsilon_i.

The output from most popular statistical packages will look similar to this:

Method: Least Squares
Dependent variable: WEIGHT
Included observations: 15

Variable Coefficient Std.Error t-statistic p-value

const 128.8128 16.3083 7.8986 0.0000
HEIGHT –143.1620 19.8332 –7.2183 0.0000
HEIGHT2 61.9603 6.0084 10.3122 0.0000

R2 0.9989     S.E. of regression 0.2516
Adjusted R2 0.9987 Model sum-of-sq 692.61
Log-likelihood 1.0890 Residual sum-of-sq 0.7595
Durbin–Watson stats. 2.1013 Total sum-of-sq 693.37
Akaike criterion 0.2548 F-statistic 5471.2
Schwarz criterion 0.3964 p-value (F-stat) 0.0000

In this table:

\overline{R}^2 = 1 - \tfrac{n-1}{n-p}(1-R^2)

Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:

An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.

Beware

This example also demonstrates that sophisticated calculations will not overcome the use of badly prepared data. The heights were originally given in inches, and have been converted to the nearest centimetre. Since the conversion factor is one inch to 2.54cm, this is not a correct conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric: if this is done, the results become

 const      height   Height2
128.8128  -143.162   61.96033  incorrectly converted to metric.
119.0205  -131.5076  58.5046   correctly converted.

Thus a seemingly small variation in the data has a real effect.

See also

References

  1. ^ Hayashi (2000, page 7)
  2. ^ Hayashi (2000, page 187)
  3. ^ a b Hayashi (2000, page 10)
  4. ^ Hayashi (2000, page 34)
  5. ^ Hayashi (2000, page 15)
  6. ^ Hayashi (2000, page 18)
  7. ^ a b Hayashi (2000, page 19)
  8. ^ Hayashi (2000, page 20)
  9. ^ Hayashi (2000, page 5)
  10. ^ Hayashi (2000, page 49)
  11. ^ a b Hayashi (2000, page 52)
  12. ^ Hayashi (2000, pages 27, 30)
  13. ^ a b c Hayashi (2000, page 27)
  14. ^ Amemiya (1985, page 13)
  15. ^ Amemiya (1985, page 14)
  16. ^ Rao (1973, page 319)
  17. ^ Amemiya (1985, page 20)
  18. ^ Amemiya (1985, page 27)
  19. ^ a b Davidson & Mackinnon (1993, page 33)
  20. ^ Davidson & Mackinnon (1993, page 36)
  21. ^ Davidson & Mackinnon (1993, page 20)
  22. ^ Amemiya (1985, page 21)
  23. ^ a b Amemiya (1985, page 22)
  24. ^ Burnham, Kenneth P.; David Anderson (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. ISBN 0387953647. 

  • Amemiya, Takeshi (1985). Advanced econometrics. Harvard University Press. ISBN 0-674-00560-0. 
  • Davidson, Russell; Mackinnon, James G. (1993). Estimation and inference in econometrics. Oxford University Press. ISBN 978-0-19-506011-9. 
  • Greene, William H. (2002). Econometric analysis (5th ed.). New Jersey: Prentice Hall. ISBN 0-13-066189-9. http://bib.tiera.ru/DVD-010/Greene_W.H._Econometric_analysis_(2002)(5th_ed.)(en)(983s).pdf. Retrieved 2010-04-26. 
  • Hayashi, Fumio (2000). Econometrics. Princeton University Press. ISBN 0-691-01018-8. 
  • Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New York: John Wiley & Sons.